Search CORE

4 research outputs found

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Author: da Silva Rafael Ferreira
Skluzacek Tyler J.
Souza Renan
Wilkinson Sean R.
Ziatdinov Maxim
Publication venue
Publication date: 17/08/2023
Field of study

Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.Comment: 10 pages, 5 figures, 2 Listings, 42 references, Paper accepted at IEEE eScience'2

arXiv.org e-Print Archive

Recommended from our members

Automated Metadata Extraction Can Make Data Swamps More Navigable

Author: Skluzacek Tyler J.
Publication venue: The University of Chicago
Publication date: 02/09/2022
Field of study

In a science utopia, every research repository would be accompanied by a database of rich, searchable metadata that users can quickly and confidently query to discover, retrieve, and organize the many artifacts of research workflows. In practice, science is far from this utopia; repositories commonly decay into disorganized data swamps that overwhelm scientists and result in crucial research data being inaccessible to those who could use them. To dredge data swamps, I describe an automated metadata extraction system for science---Xtract---that crawls large repositories, dynamically constructs extraction workflows by intelligently mapping extractors to diverse file types, scalably executes these workflows on distributed research cyberinfrastructure, and publishes the derived metadata into a search index. I show via a user study that an Xtract-generated search index drastically increases the speed and confidence with which researchers navigate their science collections. Finally, I highlight the benefits of this approach by applying Xtract to real-world repositories collectively spanning over 6 million files and 1PB of data across materials science, climate science, battery modeling, and spectroscopy repositories

Knowledge UChicago

DLHub: Simplifying publication, discovery, and use of machine learning models in science

Author: Abadi
Agrawal
Ananthakrishnan
Anna Woodard
Avsec
Babuji
Babuji
Baker
Balaprakash
Ben Blaiszik
Blaiszik
Blaiszik
Brinckman
Chard
Chard
Chard
Chard
Chard
Chollet
Crankshaw
Forde
Gossett
Gundersen
Guo
Ian Foster
Jain
Jia
Kim
Kirklin
Kourou
Krizhevsky
Kyle Chard
Logan Ward
Madduri
Miao
Michael J. Franklin
Michie
Morin
Olson
Olston
Ong
Paszke
Pedregosa
Pordes
Rasp
Ryan Chard
Simon
Starr
Steven Tuecke
Stodden
Szegedy
Towns
Tuecke
Tyler J. Skluzacek
Ward
Ward
Ward
Wozniak
Yadu Babuji
Zhang
Zhuozhao Li
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref